Predictive Dynamic Load Balancing of Parallel Hash-Joins Over Heterogeneous Processors in the Presence of Data Skew

نویسندگان

  • Hasanat M. Dewan
  • Mauricio A. Hernández
  • Kui W. Mok
  • Salvatore J. Stolfo
چکیده

In this paper, we present new algorithms to balance the computation of parallel hash joins over heterogeneous processors in the presence of data skew and external loads. Heterogeneity in our model consists of disparate computing elements, as well as general purpose computing ensembles that are subject to external loading (e.g., a LAN connected workstation cluster). Data skew manifests itself as signiicant non-uniformities in the distribution of attribute values of underlying relations that are involved in a join. We develop cost models and predictive dynamic load balancing protocols to detect imbalance during the computation of a single large join. New predic-tive bucket scheduling algorithms are presented that smooth out the load over the entire ensemble by real-locating buckets whenever imbalance is detected. Our algorithms can account for imbalance due to data skew as well as heterogeneity in the computing environment. Signiicant performance gains are reported for a wide range of test cases on a prototype implementation of the system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pipelined Parallelism in Multi-Join Queries on Heterogeneous Shared Nothing Architectures

Pipelined parallelism was largely studied and successfully implemented, on shared nothing machines, in several join algorithms in the presence of ideal conditions of load balancing between processors and in the absence of data skew. The aim of pipelining is to allow flexible resource allocation while avoiding unnecessary disk input/output for intermediate join results in the treatment of multi-...

متن کامل

An Improved Hash-based Join Algorithm in the Presence of Double Skew on a Hypercube Computer

This paper presents an improved parallel hash-based join algorithm on a hypercube computer in the presence of double skew. We describe a load balancing technique to evenly distribute both join relations across all processors in order to deal with double skew eeectively. Moreover, we propose a permutation join method which reduces main memory requirement for the local join operation in the previ...

متن کامل

Efficient Skew Handling for Outer Joins in a Cloud Computing Environment

Outer joins are ubiquitous in many workloads and Big Data systems. The question of how to best execute outer joins in large parallel systems is particularly challenging, as real world datasets are characterized by data skew leading to performance issues. Although skew handling techniques have been extensively studied for inner joins, there is little published work solving the corresponding prob...

متن کامل

Dynamic Join Product Skew Handling for Hash-Joins in Shared-Nothing Database Systems

When data are uniformly distributed, parallel hash-based join algorithm scales up well. However, the presence of data skew can cause load imbalance among the processors, significantly deteriorating its performance. In this paper we propose a dynamic skew handling algorithm which deals with this load imbalance, by detecting and handling join product skews at run-time. The idea is to monitor the ...

متن کامل

Parleda: a Library for Parallel Processing in Computational Geometry Applications

ParLeda is a software library that provides the basic primitives needed for parallel implementation of computational geometry applications. It can also be used in implementing a parallel application that uses geometric data structures. The parallel model that we use is based on a new heterogeneous parallel model named HBSP, which is based on BSP and is introduced here. ParLeda uses two main lib...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994